Using Omissive Faults to Obtain Local Convergence in Partially Connected Networks
نویسندگان
چکیده
Approximate Agreement is an important issue in faulttolerant distributed computing where non-faulty processes exchange and vote upon their local values, to arrive at values which are within the range of the initial values of the nonfaulty processes and within a predefined tolerance of each other. Results to date in Approximate Agreement, however, are not capable of exploiting omission faults. Omission faults are presumed not to occur or a predefined default value is substituted for those values not received, or they are globally discarded before the voting algorithm executes. As a result, hybrid fault models can not differentiate between omissive and transmissive faults. The performance and fault tolerance expressions for completely connected networks, in the presence of omission faults, have recently been obtained. This paper develops a methodology which logically converts partially connected networks into completely connected networks. Hence, the results of completely connected systems can be applied to obtain the local convergence and fault tolerance expressions for partially connected systems. t t t Digital computers are essential to critical applications such as aerospace systems, air traffic control systems, nuclear power systems, computer manufacturing systems, etc. Common to all ofthese applications is the demand for maximum reliability and high performance from computer components. This requirement is necessarily stringent because a single component failure in these applications can lead to disaster. Because of such a stringent requirement, the fault-tolerant computing plays a significant part in the design of reliable and safe computers. One way of making these applications ultra-dependable is to employ hardware/software redundancy, which brings into being many issues. One is synchroni1 zation and coordination among different computer components to achieve the expected services. The synchrony, in tum involves the creation of algorithms which ensure that the good components stay in synchrony in spite of faulty ones. For example, many applications in distributed systems require the clocks of processors to be synchronized so that the distributed events can be properly monitored and executed in the proper order. However, the clocks cannot stay in perfect harmony, as they cannot operate exactly at the same speed and the messages sent between processors incur uncertain delays. In such a situation, an Approximate Agreement algorithm can be used, where processors iteratively exchange their local clock values and vote until all nonfaulty clocks converge into values within a prespecified range of each other. Agreement can easily be achieved if the system is fault-free, but it becomes very complex when faulty computers send wrong or even conflicting values to different computers. Formally, Approximate Agreement (Dolev et al. 1983, 1986) is defined by the following conditions: AI: AGREEMENT The voting algorithms executed by all non-faulty processes eventually halt with voted values that are within £ of each other. A2: VALIDITY The voted value held by each nonfaulty process is within the range of the initial values held by the non-faulty processes. Many Approximate Agreement algorithms employ multiple rounds of message exchange. In each round, each process sends its value to all receiving processes. On receipt of a collection of values, each process executes an approximation function F to obtain its latest voted value, which is used in the next round of message exchange. The objective of Approximate Agreement can be achieved by ensuring that each round is convergent, i.e. the range of the correct values is reduced in 2 M. H. Azadmanesh and A. W. Krings each round. This property, called single-step convergence, guarantees that the range of values will eventually be less than e, given enough rounds. Section 2 gives the definitions for different failure modes. Section 3 describes the limitations of the existing voting algorithms and the motivation for this research. Section 4 introduces partially connected networks, and their impact on convergence properties. Section 5 describes the impact of omissive faults on voting algorithms. It also shows how a partially connected system can logically look like a completely connected network. Section 6 defines two sub-families of algorithms called dynamic-a and fixed-a. Sections 7 and 8 show the convergence rate and fault tolerance for the two sub-families of algorithms. Section 9 provides an example to better understand the process of determining whether convergence is possible, using the expressions obtained in the previous sections. Finally Section 10 concludes the paper and comments on future research prospects. 2. FAULT MODE DEFINITIONS Recent research has addressed convergent voting in the presence of multiple fault modes (Azadmanesh and Kieckhafer 1995, Kieckhafer and Azadmanesh 1993, 1994). This work uses the hybrid fault model of Thambidurai and Park (1988), which partitions faults into three modes: benign, symmetric, and asymmetric. Benign faults are defined as those which are self-incriminating or self-evident to all processes. A symmetric fault is defined as a fault whose value is perceived identically by all receiving non-faulty processes. An asymmetric fault is the one which is capable of sending conflicting (arbitrary) messages to different non-faulty processes. Using this hybrid fault model, the total number of faults, containing a asymmetric, s symmetric, and b benign faults, is t = a + s + b. Under this fault model, simple expressions were derived for the performance and fault-tolerance of a broad family of convergent voting algorithms called Mean-Subsequence-Reduced (MSR) algorithms (Kieckhafer and Azadmanesh 1993, 1994). Hybrid analysis of MSR produced more accurate bounds on the properties of the algorithms than possible with any single-mode fault model. However, these algorithms along with other traditional algorithms (Dolev et al. 1986, Kieckhafer and Azadmanesh 1994, Lamport and Melliar-Smith 1985, Meyer and Pradhan 1987, Thambidurai and Park 1988) cannot exploit the omission failure mode. An omission occurs when a process does not receive a value from a faulty process. These algorithms either assume that omissions do not occur or replace the omission with a predefined default value. However by a similar observation that Byzantine faults were partitioned into asymmetric and symmetric, asymmetric and symmetric faults can each be further subdivided into transmissive and omissive modes. A transmissive fault occurs when one or more processes receive erroneous values. An omissive fault occurs when a faulty process does not deliver its value to one or more processes. An asymmetric fault can be either transmissive, i.e. when a faulty process delivers conflicting values to all receiving processes, or it can be simultaneously transmissive and omissive, i.e. when a faulty process delivers a value to one or more processes and no value to others. On the other hand, symmetric faults, by definition, are either transmissive, i.e. the same erroneous value is delivered to all receiving processes, or are omissive when no value is delivered to any process. Several failure modes can be classified under omissive faults, such as a crash fault or a fail-stop fault, where a process fails to transmit any messages, or a timing fault, where a process does not respond within the specified time frame (Cristian et al. 1985, 1986, 1989; Schneider 1984). In addition, by a modest amount of internal self-checking or using authenticated messages (Cristian et al. 1985, Wakerly 1978), the locally diagnosed benign errors can be transformed into omissive errors, increasing the count of the latter dramatically.
منابع مشابه
Data aggregation in partially connected networks
With the diverse new capabilities that sensor and ad-hoc networks can provide, applicability of data aggregation is growing. Data aggregation is useful in dealing with multi-value domain information, which often requires approximate agreement decisions among nodes. In contrast to fully connected networks, the research on data aggregation for partially connected networks is very limited. This is...
متن کاملExploiting Markov Chains to Reach Approximate Agreement in Partially Connected Networks
The research in reaching Approximate Agreement (AA) for fully connected networks is relatively mature. In contrast, the literature survey of the AA problem for partially connected networks is evident of considerably less work. This is due to the fact that a node may not have a complete view of the global network, which makes it difficult to attain the convergence properties. The complexity of t...
متن کاملNetwork Convergence in the Presence of Omission Faults
Network Convergence in the presence of various fault modes has been studied for completely connected networks extensively. However, complete connectivity is impractical for large distributed systems. In addition, to attain network convergence for partially connected systems, most research presume message relays by intervening nodes. In other words, the network is logically converted to a comple...
متن کاملUniform connectedness and uniform local connectedness for lattice-valued uniform convergence spaces
We apply Preuss' concept of $mbbe$-connectedness to the categories of lattice-valued uniform convergence spaces and of lattice-valued uniform spaces. A space is uniformly $mbbe$-connected if the only uniformly continuous mappings from the space to a space in the class $mbbe$ are the constant mappings. We develop the basic theory for $mbbe$-connected sets, including the product theorem. Furtherm...
متن کاملData Aggregation in Multi- Agent Systems in the Presence of Hybrid Faults
Data Aggregation (DA) is a set of functions that provide components of a distributed system access to global information for purposes of network management and user services. With the diverse new capabilities that networks can provide, applicability of DA is growing. DA is useful in dealing with multi-value domain information and often requires the agents to exchange messages with the others to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013